Reading in data files
Humboldt-Universität zu Berlin
2023-05-08
Last week we…
ggplot2
Today we will…
palmerpenguings
pacman package
p_load() takes package names as argumentstidyverse loaded, and the new packages janitor and here installed and loadediris dataset
Image source: Analytics Vidhya (all rights reserved)
glimpse()tibble packageRows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
summary()Aufgabe 1: table1
Example 1
table1
.xlsx; if you have an Excel dataset try saving it as a .csv before reading it into Rcsv is the most common data file type: Comma Separated ValuesStudent ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
the first row (the “header row”) contains the columns names
the subsequent rows contain the data
how many variables are there? how many observations?
readr packagepenguins and iris
readr package (part of tidyverse) can load in most data types| Student ID | Full Name | favourite.food | mealPlan | AGE |
|---|---|---|---|---|
| 1 | Sunil Huffmann | Strawberry yoghurt | Lunch only | 4 |
| 2 | Barclay Lynn | French fries | Lunch only | 5 |
| 3 | Jayendra Lyne | N/A | Breakfast and lunch | 7 |
| 4 | Leon Rossini | Anchovies | Lunch only | NA |
| 5 | Chidiegwu Dunkel | Pizza | Breakfast and lunch | five |
| 6 | Güvenç Attila | Ice cream | Lunch only | 6 |
Aufgabe 2: table1
Example 2
students.csv dataset and save it as an object called df_students
df_ is short for DataFrame; it’s a good idea to use a prefix before object names so we know what each object containsread_csv, some information is printed in the Console. What is printed?here packagedaten folder?here::here()
here() is starting from, run here()
[1] "/Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten"
daten/students.csv)here packageImage source: Allison Horst (all rights reserved)
here Paket
Before the here package, we used to have to explicitly tell R where on our computer a file was located (e.g., /Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten/daten/students.csv), or use the setwd() (set Working Directory) function to tell R where to assume where all files are located (e.g., setwd(/Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten)). Luckily, you never need to use these absolute file paths or setwd()!
From the here package documentation:
The goal of the here package is to enable easy file referencing in project-oriented workflows. In contrast to using
setwd(), which is fragile and dependent on the way you organize your files, here uses the top-level directory of a project to easily build paths to files.
This means we now have the huge benefit of being able to move our project folder anywhere, and our file path will still be relative to wherever we’ve moved our project fold. This mean the project runs independent of where on your computer it is located. You can also send somebody the project folder, and everything should run on their machine!
df_students dataframe, you might’ve noticed some NA or N/A values
N/A was written as text, and so R reads it as suchNAs in R refer to missing data (“Not Available”)NAs) that were not plottedN/A written in our df_students data is not actually read as a missing valuena = for the read_csv() function
read_csv() which values it should equate with missing values# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only "4"
2 2 Barclay Lynn French fries Lunch only "5"
3 3 Jayendra Lyne <NA> Breakfast and lunch "7"
4 4 Leon Rossini Anchovies Lunch only ""
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch "five"
6 6 Güvenç Attila Ice cream Lunch only "6"
NA
read_csv() reading empty cells as NA
read_csv() to read more than one type of input as NA?"" and "N/A" as NA
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
df_students in the Console we’ll see that the first two columns names are surrounded by backticks (e.g., `Student ID`)
clean_names() from the janitor package# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
head(df_students), do you see the cleaned column names?read_csv(), clean_names()) on the same object
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
magrittr package pipe: %>%
|>
%>%
Aufgabe 3: pipes
Example 3
students.csv dataset again with fixed NAs and then
clean_names() on the dataset, and then
head() functionstudents.csv dataset again with fixed NAs, saving it as the object df_students, and then
clean_names() on the data sethead() function when you’re saving the dataset as an object?read_csv(), readr’s other functions are easy to use
read_csv2() reads semicolon-separated files
; instead of , to separate fields and are common in countries that use , as the decimal markerread_tsv() reads tab-delimited filesread_delim() reads in files with any delimiter
delim = (e.g., read_delim(students.csv, delim = ","))Others I haven’t yet needed:
read_fwf() reads fixed-width filesread_table() reads a common variation of fixed-width files where columns are separated by white spaceread_log() reads Apache-style log filesAufgabe 4: filetypes
Example 4
|”?read_csv() and read_tsv() have in common?nettle_1999_climate.csv
nettle_1999_climate2.csv
nettle_1999_climate3.csv
tibble
tibbles are modern dataframes, don’t worry about the definition of a tibble just yetddmm)tibble()tribble()tribble)Aufgabe 6: tibbles
Example 5
df_wir
initial height month day
Length:1 Min. :171 Min. :5 Min. :7
Class :character 1st Qu.:171 1st Qu.:5 1st Qu.:7
Mode :character Median :171 Median :5 Median :7
Mean :171 Mean :5 Mean :7
3rd Qu.:171 3rd Qu.:5 3rd Qu.:7
Max. :171 Max. :5 Max. :7
readr guesses the type of data each column contains
numerical and factor
factors contain categories or groups of data, but can sometimes look like numerical data
month contains numbers, but it could also contain the name of each monthnumerical variable, but not of a factor
$: dataframe$variable
minimum and maximum heights in our groupsum of our heightswrite_csv(object, "filename")
Here are some more in-depth exercises.
starwars dataset, that contains information about Star Wars characters.csv file called starwars in your daten folder.starwars.csv data file using read_csv()
hair_colorskin_coloreye_colorsexgenderhomeworldspeciesProduce the following three plots, and briefly describe what they show and any conclusions that can be drawn from them.
Create another plot of your choosing from the starwars dataset. Add it to the plot grid (you’ll have to adjust the syntax). Describe what it shows.
Heute haben wir…
Hergestellt mit R version 4.2.3 (2023-03-15) (Shortstop Beagle) und RStudioversion 2023.3.0.386 (Cherry Blossom).
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.1.2 here_1.0.1 janitor_2.2.0 lubridate_1.9.2
[5] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1
[9] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 pillar_1.9.0 compiler_4.2.3 tools_4.2.3
[5] bit_4.0.5 digest_0.6.31 timechange_0.2.0 jsonlite_1.8.4
[9] evaluate_0.20 lifecycle_1.0.3 gtable_0.3.3 pkgconfig_2.0.3
[13] rlang_1.1.0 cli_3.6.1 rstudioapi_0.14 parallel_4.2.3
[17] yaml_2.3.7 xfun_0.38 fastmap_1.1.1 withr_2.5.0
[21] knitr_1.42 generics_0.1.3 vctrs_0.6.1 hms_1.1.3
[25] bit64_4.0.5 rprojroot_2.0.3 grid_4.2.3 tidyselect_1.2.0
[29] snakecase_0.11.0 glue_1.6.2 R6_2.5.1 fansi_1.0.4
[33] vroom_1.6.1 rmarkdown_2.21 pacman_0.5.1 farver_2.1.1
[37] tzdb_0.3.0 magrittr_2.0.3 scales_1.2.1 htmltools_0.5.5
[41] colorspace_2.1-0 labeling_0.4.2 utf8_1.2.3 stringi_1.7.12
[45] munsell_0.5.0 crayon_1.5.2
Woche 4 - Dateneinlesung